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Abstract 


This  paper  presents  early  experience  with  a  typed  programming  language  and  compiler  for  run-time  code 
generation.  The  language  is  an  extension  of  the  SML  language  with  modal  operators,  based  on  the  AD 
language  of  Davies  and  Pfenning.  It  allows  programmers  to  specify  precisely,  through  types,  the  stages  of 
computation  in  a  program.  The  compiler  generates  target  code  that  makes  use  of  run-time  code  generation 
in  order  to  exploit  the  staging  information.  The  target  machine  is  currently  a  version  of  the  Categorial 
Abstract  Machine,  called  the  CCAM,  which  we  have  extended  with  facilities  for  run-time  code  generation. 
Using  this  approach,  the  programmer  is  able  to  express  the  staging  that  he  wants  to  the  compiler  directly.  It 
also  provides  a  typed  framework  in  which  to  verify  the  correctness  of  his  staging  intentions,  and  to  discuss  his 
staging  decisions  with  other  programmers.  Finally,  it  supports  in  a  natural  way  multiple  stages  of  run-time 
specialization,  so  that  dynamically  generated  code  can  be  used  to  generate  yet  further  specialized  code. 
This  paper  presents  an  overview  of  the  language,  with  several  examples  of  programs  that  illustrate  key 
concepts  and  programming  techniques.  Then,  it  discusses  the  CCAM  and  the  compilation  of  AD  programs 
into  CCAM  code.  Finally,  the  results  of  some  experiments  are  shown,  to  demonstrate  the  benefits  of  this 
style  of  run-time  code  generation  for  some  applications. 


1  Introduction 

In  this  paper,  we  present  a  programming  language 
that  allows  programmers  to  specify  stages  of  compu¬ 
tation  in  a  program,  along  with  an  implementation 
technique  based  on  run-time  code  generation  for  ex¬ 
ploiting  the  staging. 

A  well-known  technique  for  improving  the  perfor¬ 
mance  of  a  computer  program  is  to  separate  its  com¬ 
putations  into  distinct  stages.  If  this  is  done  carefully, 
the  results  of  early  computations  can  be  exploited  in 
later  computations  in  a  way  that  leads  to  faster  ex¬ 
ecution.  To  achieve  this  effect,  programmers  often 
stage  their  program  manually,  using  ad  hoc  meth¬ 
ods;  there  have  also  been  some  attempts  to  make 
such  staging  transformations  more  systematic  [16]. 
Another  approach,  used  in  partial  evaluation  [8],  is 
to  automate  the  staging  of  programs  according  to  a 
programmer-supplied  indication  of  which  program  in¬ 
puts  will  be  available  in  the  first  stage  of  computa¬ 
tion.  This  information  is  used  to  synthesize  a  gen¬ 
erating  extension  that  will  generate  specialized  code 
for  the  late  stages  of  the  computation  when  given  the 
first-stage  inputs.  More  recent  work  has  extended  the 
partial  evaluation  framework  to  account  for  multiple 
computation  stages  [7]. 

In  recent  years,  several  researchers  have  studied  the 
use  of  run-time  code  generation  (RTCG)  to  exploit 
staged  computation  [1,  3,  9,  10,  12].  One  advantage  of 
RTCG  is  that  opens  the  possibility  of  low-level  code 
optimizations  (such  as  register  allocation,  instruction 
selection,  loop  unrolling,  array-bounds  checking  re¬ 
moval,  and  so  on)  to  take  advantage  of  values  that  are 
not  known  until  run  time.  Such  optimization  cannot 
normally  be  expressed  by  a  source-to-source  transfor¬ 
mation 

In  order  to  make  use  of  RTGC,  a  compiler  must 
first  understand  how  the  program’s  computations  are 
staged.  Determining  this  staging  information  is  not 
a  simple  matter,  however.  While  automatic  binding- 
time  analyses  have  been  used  by  partial  evaluators 
and  some  compilers  (notably  the  Tempo  system  [1]), 
we  are  interested  here  in  developing  a  programming 
language  that  supports  a  systematic  method  for  de¬ 
scribing  the  computation  stages.  Besides  provid¬ 
ing  the  programmer  with  full  control  over  when  and 


where  RTCG  occurs,  we  believe  the  overall  imple¬ 
mentation  should  also  become  much  simpler  since  the 
complexity  of  a  sophisticated  automatic  analysis  can 
be  avoided. 

The  idea  of  using  a  programming  notation  for  stag¬ 
ing  is  far  from  new.  The  backquote  and  antiquote 
notation  of  Lisp  macros,  for  example,  provides  an  in¬ 
tuitive  though  highly  error-prone  approach  to  staged 
computation.  More  recent  annotation  schemes  used 
by  RTCG  systems  include  that  of  ‘C  [3]  and  Fabius 
[10].  These  languages  allow  the  programmer  to  com¬ 
municate  his  intentions  to  the  compiler  in  a  relatively 
straightforward  manner.  Unfortunately,  in  the  case 
of  the  Fabius  system,  the  annotation  scheme  is  ex¬ 
tremely  simple,  thus  limiting  the  ability  of  the  pro¬ 
grammer  to  express  staging  decisions.  The  difficulty 
in  ‘ C  (and  Lisp),  on  the  other  hand,  is  that  there  is  no 
direct  way  to  ensure  that  the  staging  behavior  which 
the  programmer  specifies  is  correct:  programs  can  be 
(and  are)  written  that  will  result  in  run-time  errors. 
Such  errors  include  referencing  a  variable  that  is  not 
yet  available,  and  referencing  variables  which  are  no 
longer  available. 

We  propose  that  an  extension  of  the  SML  language 
and  type  system  can  be  used  as  a  clear  and  expressive 
notation  for  staged  computation.  Drawing  on  previ¬ 
ous  work  on  the  language  AD  [2]  which  is  based 
on  the  modal  logic  S4,  and  on  the  interpretation  of 
this  language  for  run-time  code  generation  described 
in  [18],  we  present  an  implementation  of  a  prototype 
compiler  for  a  version  of  the  SML  language  (without 
modules)  that  uses  modal  operators  to  specify  early 
and  late  stages  of  a  program’s  computation.  We  then 
apply  compilation  techniques  patterned  after  those 
developed  for  the  Fabius  system  [10]  in  order  to  com¬ 
pile  programs  into  code  that  performs  RTCG  accord¬ 
ing  to  the  mode  of  each  subexpression  in  the  program. 
We  believe  that  using  the  modal  source  language  has 
the  following  advantages: 

•  The  programmer  is  able  to  express  the  staging 
that  he  wants  to  the  compiler  directly,  rather 
than  indirectly  through  a  heavyweight  (and  usu¬ 
ally  unpredictable)  analysis. 

•  The  programmer  is  given  a  framework  which  al- 
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A->B\nA 


lows  him  to  verify  the  correctness  of  his  stag¬ 
ing  intentions.  A  staging  error  becomes  a  type 
error  which  can  be  analyzed  and  fixed,  rather 
than  simply  resulting  in  a  slow  or  incorrect  im¬ 
plementation.  Furthermore,  this  framework  is 
useful  for  conceptualizing  and  discussing  staging 
with  other  programmers  through  typing  specifi¬ 
cations. 


•  This  approach  is  complementary  to  the  use  of 
automatic  staging  through  binding-time  analy¬ 
sis.  A  compiler  is  free  to  augment  the  staging 
requirements  from  a  hand-staged  program  using 
any  other  means  at  its  disposal. 

•  The  language  naturally  handles  situations  in 
which  more  than  two  stages  are  desired,  such  as 
Fabius-style  multi-stage  specialization  [11].  This 
arises,  for  example,  when  dynamically  generated 
code  can  used  to  compute  values  that  are  used 
in  the  dynamic  specialization  of  yet  more  code. 


In  order  to  demonstrate  these  advantages,  we 
have  implemented  a  prototype  compiler  for  the  Stan¬ 
dard  ML  language  (without  modules),  extended  with 
modal  operators  and  types.  The  compiler  generates 
code  for  a  version  of  the  Categorial  Abstract  Ma¬ 
chine  [6],  called  the  CCAM,  which  is  extended  with 
a  facility  for  emitting  fresh  code  at  run  time. 

We  begin  the  paper  with  a  brief  introduction  to 
the  AD  language,  on  which  our  dialect  of  SML  is 
based.  Then,  we  give  a  series  of  program  examples, 
to  show  what  it  is  like  to  write  staged  programs  in 
our  language.  These  examples  are  chosen  to  illus¬ 
trate  different  aspects  of  staged  computation,  includ¬ 
ing  Fabius-style  multi-stage  specialization.  Next,  we 
present  the  CCAM,  followed  by  a  description  of  how 
AD  programs  are  compiled  into  CCAM  code.  Finally, 
we  discuss  some  of  the  details  of  the  actual  implemen¬ 
tation  of  our  compiler,  and  present  some  benchmark 
results  to  show  how  staged  programs  can  lead  to  bet¬ 
ter  performance. 


Types  A,  B 

Terms  M,  Ar  x\\x.M\MN 

|  u  |  code  M  |  lift  M 
|  let  cogen  u  —  M  in  N 

Contexts  T  ::=  •  |  T,  x  :  A  |  T,  u  :  A 

Figure  1:  AD  Syntax 

2  The  Modal  Lambda- Calculus 

We  briefly  introduce  the  language  AD  which  is  a  sim¬ 
plification  of  the  explicit  version  of  MLD  described  in 
Davies  and  Pfenning  [2].  Although  we  present  only 
AD  here  because  of  space  considerations,  the  compi¬ 
lation  technique  described  in  the  section  5  extends 
easily  to  all  core  SML  constructs.  Indeed,  we  have 
implemented  a  prototype  compiler  for  most  of  core 
ML  extended  with  the  modal  constructs. 

2.1  Syntax 

An  arises  from  the  simply-typed  A-calculus  by  adding 
a  new  type  constructor  □.  Except  for  the  addition  of 
lift ,  it  is  related  to  the  modal  logic  54  by  an  extension 
of  the  Curry-Howard  isomorphism,  where  □  A  means 
“ A  is  necessarily  true” . 

In  our  context,  we  think  of  □  A  as  the  type  of 
generators  for  code  of  type  A.  Generators  are  cre¬ 
ated  with  the  code  M  construct.  For  example, 
h  code  (Ax. a?)  :  m(A  — y  A)  is  a  generator  which, 
when  invoked,  generates  code  for  the  identity  func¬ 
tion  and  then  calls  it.  Figure  1  presents  the  syntax 
of  AD  .  Note  that  there  are  two  kinds  of  variables: 
value  variables  bound  by  A  (denoted  by  x)  and  code 
variables  bound  by  let  cogen  (denoted  by  u ).  Their 
role  is  explained  below. 

To  invoke  a  generator,  one  might  expect  a  corre¬ 
sponding  eval  construct  of  type  (DA)  — >  A.  Such  a 
function  is  in  fact  definable,  but  not  a  suitable  basis 
for  the  language.  Instead  we  have  a  binding  construct 
let  cogen  u  =  M  in  N  which  expects  a  code  genera¬ 
tor  M  of  type  □  A  and  binds  a  code  variable  u  (which 
we  will  sometimes  call  a  modal  variable).  However, 
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even  evaluation  of  let  cogen  u  =  M  in  N  will  not 
immediately  generate  code. 

Code  will  not  be  emitted  until  the  modal  variable 
u  is  encountered  during  evaluation.  For  example, 

b  Ax1. let  cogen  u  =  x  in  u  :  (DA)  — >  A 

is  the  function  eval  mentioned  above  which  invokes 
a  generator  and  to  create  new  code,  and  then  evalu¬ 
ates  that  code. 

Generation  of  code  is  postponed  as  long  as  possible 
so  that  the  context  into  which  the  code  is  emitted  can 
be  used  for  optimizations.  For  example,  the  following 
is  a  higher-order  function  which  takes  generators  for 
two  functions  and  creates  a  generator  for  their  com¬ 
position.  The  result  may  be  significantly  more  effi¬ 
cient  than  generating  first  and  then  composing  the 
resulting  functions.  Note  that  this  function  returns  a 
generator,  but  does  not  call  the  given  generators  or 
emit  code  itself. 

b  A/.Ag.let  cogen  /'  =  /  in 

let  cogen  g'  =  g  in  code  Ax.  f'(gf(x)) 

:  □(£  ->(?)-►  D(A  -+B)->  D(A  ->  C) 

Readers  familiar  with  MLD  will  notice  that  we 
have  added  the  operator  lift,  which  obeys  the  rule 
that  lift  M  has  type  DA  if  M  has  type  A.  lift  M 
evaluates  M  and  returns  a  generator  which  just 
“quotes”  the  resulting  value.  In  contrast  to  code  this 
prohibits  all  optimizations  during  code  generation. 
As  noted  in  Davies  and  Pfenning  [2],  lift  is  definable 
in  MLd  for  base  types,  but  its  general  form  has  no 
logical  foundation.  Here  we  show  that  it  nonetheless 
has  a  reasonable  and  useful  operational  interpreta¬ 
tion  in  the  context  of  run-time  code  generation. 

2*2  Typing  Rules 

The  typing  rules,  presented  in  figure  2,  use  two  con¬ 
texts:  a  modal  context  A  in  which  code  variables  are 
declared,  and  an  ordinary  context  T  declaring  value 
variables.  The  typing  rules  are  the  familiar  ones  for 
the  A-calculus  plus  the  rules  for  “let  cogen” ,  “code” 
and  “lift”. 

The  critical  restriction  which  guarantees  proper 
staging  is  that  only  code  variables  (which  occur  in 


A)  are  permitted  to  occur  free  in  generators  (under¬ 
neath  the  code  constructor),  but  no  value  variables. 
The  let  boxed  rule  expresses  that  if  we  have  a  value 
which  is  a  code  generator  (and  therefore  of  type  CL4), 
we  can  bind  a  code  variable  u  of  type  A  which  may 
be  included  in  other  code  generators. 

3  Programming  with  MLD 

In  order  to  give  a  feeling  for  what  it  is  like  to  write 
MLd  programs  we  present  several  examples  here. 

3.1  Computing  the  Value  of  Polyno¬ 
mials 

To  start  with  a  simple  example,  consider  the  follow¬ 
ing  ML  function  which  evaluates  a  given  polynomial 
for  a  given  base.  For  this  function,  the  polynomial 
«o  +  a\X  +  a>2X2  H-  •  •  -anxn  is  represented  as  the  list 

[^0,  Gt l ,  U2?  •  •  •  i  °>n\- 

type  poly  =  int  list; 

val  polyl  =  [2,4,0,2333]; 

(*  val  evalPoly  :  int  *  poly  ->  int  *) 
fun  evalPoly  (x,  nil)  =  0 
I  evalPoly  (x,  a: :p)  = 
a  +  (x  *  evalPoly  (x,  p)); 

If  this  function  were  called  many  times  with  the 
same  polynomial  but  different  bases,  it  might  be  prof¬ 
itable  to  specialize  it  to  the  particular  polynomial,  in 
effect  synthesizing  an  ML  function  that  directly  com¬ 
putes  the  polynomial  rather  than  interpreting  its  list 
representation.  One  way  that  we  can  accomplish  this 
is  by  transforming  the  code  as  follows. 

fun  specPoly  (nil)  = 

(fn  x  =>  0) 

I  specPoly  (a: :p)  = 
let 

val  polyp  =  specPoly  p 
in 

fn  x  =>  a  +  (x  *  polyp  x) 

end 
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x  :  A  in  T 

A:rhx:i 


A;T\~M:A~^B  A;rhiV:.4 
A;  T  h  MN  :  B 


A-hM  :A 


A;  T  h  code  M  :  UA 


A;  (r,a? :  A)  M  :  B 
A;  T  h  Xx.M  :  A  B 


u  :  A  in  A 
A;T\~  u:A 


A;T  \r  M  :  A 
A;T  h  lift  M  :  UA 


A;  T  i~  M  :  DA  {A,u  :  A);T  \-  N  :  B 
A;  T  h  let  cogen  u  ~  M  in  TV  :  B 

Figure  2:  Typing  rules  for  AD 


val  polyltarget  =  specPoly  polyi; 

While  polyltarget  is  an  improvement  over  the 
more  general  evalPoly,  it  is  far  from  the  fully  special¬ 
ized  result  we  would  like.  Without  support  from  the 
compiler,  common  source-level  optimizations  are  not 
performed,  such  as  unfolding  of  applications.  Fur¬ 
thermore,  code-level  optimizations  cannot  take  ad¬ 
vantage  of  the  staging,  for  example  in  instruction  se¬ 
lection  and  register  allocation.  Therefore  we  rewrite 
specPoly  as  the  MLD  function  compPoly. 

(*  val  compPoly  :  poly  ->  (int  ->  int)  $  *) 
fun  compPoly  (nil)  = 
code  (fn  x  =>  0) 

I  compPoly  (a: :p)  = 
let 

cogen  f  =  compPoly  p 
cogen  aJ  =  lift  a 
in 

code  (fn  x  =>  a*  +  (x  *  f  x)) 

end 

val  codeGenerator  =  compPoly  polyi; 
val  mlPolyFun  =  eval  codeGenerator; 

Here  the  code  operator  marks  the  introduction  of  a. 
code  generator,  and  the  postfix  type  constructor  $  is 
the  □  type.  Thus  the  compPoly  function  takes  a  list 
of  code  generators  for  integers  and  transforms  it  into 
a  code  generator  for  a  function  that  computes  the 
value  of  the  polynomial  for  a  particular  base. 


3.2  Libraries 

Suppose  we  were  to  build  a  library  of  useful  func¬ 
tions.  One  possibility  afforded  by  ML°  is  to  install 
staged  versions  of  the  library  routines,  so  that  client 
applications  can  benefit  from  dynamic  specialization 
of  the  library  code. 

Consider,  for  example,  placing  the  compPoly  func¬ 
tion  in  a  library.  Then,  suppose  we  have  a  client 
application  program: 

(*  val  client  :  tl  ->  (t2  ->  t3)  $  *) 
fun  client  x  = 

. . .  code  (fn  y  => 

. . .  compPoly  (makePoly  y)  . . . ) 

Even  though  the  client  program  does  not  have  ac¬ 
cess  to  the  source  code  of  compPoly  library  routine,  it 
is  still  able  to  benefit  from  the  fact  that  it  will  perform 
RTCG  on  the  polynomial  computed  by  makePoly 
(which  presumably  has  type  t2  ->  poly). 

This  example  also  illustrates  one  way  multi-stage 
specialization  can  be  achieved  in  our  system.  Note 
that  the  client  program  takes  the  argument  x  and 
generates  code  for  a  t2  ->  t3  function,  and  that  it 
is  this  dynamically  generated  code  that  invokes  the 
compPoly  function.  Hence,  dynamically  generated 
code  can  compute  values  which  in  turn  are  used  to 
generate  yet  more  code.  This  kind  of  multi-stage  spe¬ 
cialization  is  extremely  difficult  to  achieve  in  stan¬ 
dard  partial  evaluation,  but  falls  out  naturally  in  our 
framework. 


4 


3.3  Packet  Filters 


A  packet  filter  is  a  procedure  invoked  by  an  operat¬ 
ing  system  kernel  to  select  network  packets  for  de¬ 
livery  to  a  user-level  process.  To  avoid  the  overhead 
of  a  context  switch  on  every  packet,  a  packet  filter 
must  be  kernel  resident.  But  kernel  residence  has  a 
distinct  disadvantage:  it  can  be  difficult  for  a  user- 
level  process  to  specify  precisely  the  types  of  pack¬ 
ets  it  wishes  to  receive,  because  packet  selection  cri¬ 
teria.  are  different  for  each  application  and  can  be 
quite  complicated.  As  a  result,  many  useless  packets 
may  be  delivered,  with  a  consequent  degradation  of 
performance.  A  commonly  adopted  solution  to  this 
problem  is  to  allow  user-level  processes  to  install  a 
program  that  implements  a  selection  predicate  into 
the  kernel’s  address  space  [15,  14].  In  order  to  ensure 
that  the  selection  predicate  will  not  corrupt  internal 
kernel  structures,  the  predicate  must  be  expressed  in 
a  “safe”  programming  language.  Unfortunately,  this 
approach  has  a  substantial  overhead,  since  the  safe 
programming  language  is  typically  implemented  by  a 
simple  (and  therefore  easy-to-trust)  interpreter. 

As  demonstrated  by  several  researchers,  run-time 
code  generation  can  eliminate  the  overhead  of  in¬ 
terpretation  by  specializing  the  interpreter  to  each 
packet  filter  program  as  it  is  installed.  This  has  the 
effect  of  compiling  each  packet  filter  into  safe  native 
code  [5,  10,  13,  17].  To  demonstrate  this  idea  in  our 
language,  consider  the  following  excerpt  of  the  imple¬ 
mentation  of  a  simple  interpreter  for  the  BSD  packet 
filter  language  [14]  in  SML. 

(*  val  evalpf  :  instruction  array  * 

*  int  array  * 

*  int  *  int  *  int  ->  int 

*  Return  1  to  select  packet,  0  to  reject, 

*  "1  if  error 

*) 

fun  evalpf  (filter,  pkt ,  A,  X,  pc)  = 
if  pc  >  length  filter  then  "1 
else  case  sub  (filter,  pc)  of 
RET_A  =>  A 
I  RET_K(k)  =>  k 
I  LD_IND(i)  => 

let  val  k  =  X  +  i  in 

if  k  >  length  pkt  then  ~1 


else 

evalpf  (filter,  pkt, 

sub(pkt,k),  X,  pc+1) 

end 

The  interpreter  is  given  by  a  simple  function  called 
evalpf,  which  is  parameterized  by  the  filter  program, 
a  network  packet,  and  variables  that  encode  the  ma¬ 
chine  state.  The  machine  state  includes  an  accumu¬ 
lator,  a  scratch  register,  and  program  counter. 

In  order  to  stage  this  function,  it  is  straightfor¬ 
ward  to  transform  the  code  so  that  the  packet  fil¬ 
ter  program  and  program  counter  are  “early”  values, 
and  the  packet,  accumulator,  and  scratch  register  are 
“late.”  Then,  the  computations  that  depend  only  on 
the  late  values  can  be  generated  dynamically  by  en¬ 
closing  them  in  code  constructors. 

(*  val  bevalpf  : 

*  (instruction  array  *  int)  -> 

*  (int  *  int  *  int  array  ->  int)  $ 

*) 

fun  bevalpf  (filter,  pc)  = 

if  pc  >  length  filter  then  (fn  _  =>  "1) 
else  case  sub  (filter,  pc)  of 

RET_A  =>  code  (fn  (A,X,pkt)  =>  A) 

I  RET_K(k)  => 

let  cogen  k>  =  lift  k  in 
code  (fn  _  =>  kJ ) 
end 

I  LD_IND(i)  => 

let  cogen  ev  = 

bevalpf  (filter,  pc+i) 
cogen  iJ  =  lift  i 
in 

code  (fn  (A,X,pkt)  => 

let  val  k  =  X  +  iJ  in 
if  k  >=  length  pkt 
then  "1 

else  ev  (sub(pkt,k), 

X ,  pkt ) 

end) 

When  applied  to  a  filter  program  and  program 
counter,  the  result  of  bevalpf  is  the  CCAM  code 
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of  a  function  that  takes  a  machine  state  and  packet, 
and  computes  the  result  of  the  packet  filter  on  that 
packet  and  state.  Later,  in  Section  6,  we  show  that 
the  improvement  in  execution  time  for  a  typical  BPF 
packet  filter  is  substantial. 

3.4  Memoizing  MLD  Programs 

Since  specializing  programs  at  run  time  typically  in¬ 
volves  additional  expense,  a  central  assumption  of 
this  approach  is  that  the  specialized  code  generated 
will  often  be  used  many  times.  This  happens  natu¬ 
rally  in  some  programs.  If,  for  example,  a  program 
specializes  a  section  of  code  and  then  immediately,  in 
the  same  scope  in  the  code,  uses  that  specialized  code 
many  times,  it  is  easy  to  bind  the  generated  code  to 
a  variable  and  use  that  variable,  thereby  avoiding  re¬ 
generation  of  the  code.  In  other  situations  we  must 
work  harder  to  get  this  sort  of  “memoizing”  behavior. 

Consider  the  following  specializing  function  to 
compute  the  value  of  an  integer  raised  to  the  power 
of  e. 

(*  val  codePower  :  int  ->  (int  ->  int)$  *) 
fun  codePower  e  = 
if  e  =  0  then 

code  (fn  _  =>  1) 

else 

let 

cogen  p  =  codePower  (e  -  1) 
in 

code  (fn  b  =>  b  *  (p  b)) 

end 

If  this  function  is  used  to  compute  powers  in  two 
or  more  sections  of  the  same  program,  it  is  possible 
that  the  same  code  will  be  generated  and  regenerated 
many  time,  making  the  result  program  slower  rather 
than  faster.  We  must  carefully  arrange  to  have  gen¬ 
erated  programs  saved  for  future  use  in  situations 
where  we  think  are  likely  to  be  needed  again.  For¬ 
tunately,  we  can  bind  up  this  functionality  with  the 
function  itself. 

(* 

specCode  :  (int,  int  ->  int)  table 


get  :  ( ’a,  ,b)  table  *  'a  ->  *b  option 
add  :  (Ja,  ’b)  table  *  (>a  *  *b)  ->  unit 
*) 

(*  memoPowerl  :  int  ->  int  ->  int  *) 
fun  memoPowerl  e  = 

case  lookup  (specCode,  e)  of 
NONE  => 
let 

cogen  p  =  codePower  e 

val  p J  =  p 
in 

add  (specCode,  (e,  p')); 

P* 

end 

I  SOME  p  =>  p; 

This  function  simply  embeds  the  codePower  func¬ 
tion  within  a  wrapper  that  checks  a  hash  table  to 
determine  whether  or  not  a  particular  specialized  ver¬ 
sion  of  the  function  exists.  If  it  does,  then  it  is  re¬ 
turned,  without  need  for  further  work.  Otherwise, 
codePower  is  called,  and  a  new  function  is  generated, 
stored  in  the  table,  and  returned. 

While  memoPowerl  saves  generated  code,  so  that  it 
will  benefit  from  past  computations  on  the  same  ex¬ 
ponent,  it  does  nothing  to  speed  up  the  computation 
for  two  different  exponents,  even  though  they  may 
share  subcomputations. 

memoPower2  goes  even  further  than  memoPowerl. 
It  saves  the  result  of  each  internal  call  to  the  power 
function  in  a  table,  genExts,  of  generating  exten¬ 
sions.  Then  if  it  is  called  to  compute,  for  instance, 
n65  and  then  m34  it  won’t  have  to  do  any  additional 
work  to  make  a  generating  extension  for  the  second 
call. 

(* 

specCode  :  (int,  int  ->  int)  table 
genExts  :  (int,  (int  ->  int)$)  table 
get  :  ('a,  Jb)  table  *  Ja  ->  Jb  option 
add  :  (Ja,  5  b)  table  *  (’a  *  Jb)  ->  unit 
*) 

(*  memoPower2  :  int  ->  int  ->  int  *) 
fun  memoPower2  e  = 
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(case  lookup  (specCode,  e)  of 
NONE  => 
let 

cogen  p  =  mPower  e 
val  p 5  =  p 
in 

add  (specCode,  (e,  p*)); 

P’ 

end 

I  SOME  p  =>  p) 

(*  mPower  :  int  ->  (int  ->  int)$  *) 
and  mPower  e  = 

(case  lookup  (genExt,  e)  of 
NONE  => 
let 

val  p  =  bPower  e 
in 

(add  (genExt s,  (e,  p)); 
p) 

end 

I  SOME  p  =>  p) 

(*  mPower  :  int  ->  (int  ->  int)$  *) 
and  bPower  e  = 

if  e  =  0  then 

code  (fn  _  =>  1) 

else 

let 

cogen  p  =  mPower  (e  -  1) 
in 

code  (fn  b  =>  b  *  (p  b)) 

end; 

While  specifying  memoization  behavior  by  hand  in 
this  fashion  may  be  excessively  tedious  in  some  cases, 
it  does  allow  the  programmer  to  very  carefully  con¬ 
trol  what  and  how  memoization  will  occur.  Further¬ 
more,  generic  memoization  routines  could  be  written 
that  can  easily  accomodate  most  common  memoiza¬ 
tion  needs. 

4  The  CCAM 

5  In  this  section  we  present  the  CCAM,  an  ad-hoc 
extension  of  the  CAM  [6]  which  provides  facilities  for 


run-time  code  generation  and  which  we  use  as  the 
target  of  the  compiler  detailed  in  the  next  section. 

4*1  Fabius  and  Run-Time  Code  Gen¬ 
eration 

The  Fabius  compiler [10]  delivers  dramatic  speedups 
over  conventional  compilers  for  some  programs  by 
compiling  selected  functions  in  its  input  to  generating 
extensions.  Using  values  obtained  in  earlier  compu¬ 
tations,  these  generating  extensions  create  code  spe¬ 
cialized  to  perform  later  computations.  While  several 
different  schemes  for  run-time  code  generation  have 
been  used  in  other  systems  [4,  3,  13,  12,  1]  Fabius 
is  able  to  achieve  a  remarkably  low  instruction- 
executed-to-instruction-specialized  ratio  by  a  unique 
combination  of  features. 

•  Generating  extensions  produced  by  Fabius  never 
manipulate  source-level  terms  at  run  time.  In¬ 
stead  machine  language  programs  are  synthe¬ 
sized  directly  from  machine  language  programs. 
Fabius  in  not  unique  in  this  respect:  the  Syn¬ 
thesis  kernel  [13,  12]  and  Tempo  compiler  [1]  also 
share  this  property. 

•  Fabius  encodes  terms  to  be  specialized  directly 
into  the  instruction  stream,  usually  in  the  form 
of  immediate  operands  to  instructions.  This  is  in 
contrast  to  systems  which  copy  templates  and  fill 
in  holes  at  run  time,  such  as  Tempo  and  the  Syn¬ 
thesis  kernel.  Instruction  stream  encoding  allows 
Fabius  to  be  very  flexible  about  the  kinds  of  spe¬ 
cialization  it  can  arrange  to  have  performed  at 
run  time. 

•  Programs  compiled  by  Fabius  allows  dynamic 
staging  of  code,  i.e.  the  number  of  times  that  a 
program  specializes  itself  may  be  dependent  on 
some  value  that  will  not  be  known  until  run  time. 
This  is  necessary  to  fully  exploit  the  specializa¬ 
tion  opportunities  in  many  situations.  For  exam¬ 
ple  >  many  programs  have  a  top-level  loop  which 
waits  to  receive  input  in  some  form,  and  then 
takes  appropriate  action.  Conventional  off-line 
partial  evaluation  will  fail  to  serve  such  a  pro¬ 
gram  well  because  even  multi-level  partial  eval- 


uation  has  no  way  to  specialize  on  each  of  the 
variably  many  inputs. 

4.2  An  Abstract  Machine  for  Run¬ 
time  Code  Generation 

While  developing  the  compilation  technique  for  MLa 
we  wanted  to  compile  programs  to  include  generating 
extensions  that  have  the  same  three  properties  that 
we  list  above  for  Fabius.  We  also  thought  it  desirable 
to  abstract  away  as  much  as  possible  from  the  details 
of  individual  architectures.  However,  since  we  want 
to  create  generating  extensions  that  do  not  manipu¬ 
late  source  level  terms,  but  instead  generates  machine 
instructions  directly,  details  of  the  machine  to  which 
we  compile  must  find  their  way  into  our  translation 
scheme.  For  this  reason  we  developed  the  CCAM.  We 
believe  it  to  be  a  reasonable  formalism  that  provides 
that  capabilities  that  we  need,  while  hiding  details 
about  individual  architectures  and  instruction  sets. 

The  primary  novelty  of  the  CCAM  is  the  emit(i) 
instruction.  It  is  intended  to  represent  the  series  of 
instructions  required  on  a  real  computer  to  produce 
the  instruction  i  in  a  specialized  program.  As  will  be 
made  more  clear  below,  the  CCAM  encodes  a  gen¬ 
erating  extension  as  a  series  of  emit(f)  instructions. 
This  is  designed  to  emulate  the  technique  of  run-time 
instruction  encoding  used  in  the  Fabius  compiler. 

As  an  example  of  this  form  of  code  generation  con¬ 
sider  the  instruction  emit  (add)  .  If  this  instruction 
were  compiled  to  real  machine  instructions  it  might 
be  represented  by  three  instructions,  one  which  con¬ 
tained  the  lower  16  bits  of  the  add  instruction  in  an 
immediate  load  low  instruction,  one  which  contained 
the  upper  16  bits,  and  finally  one  to  write  the  assem¬ 
bled  instruction  to  memory.  A  more  sophisticated 
specialization  system  might  compile  emit  (add)  to  to 
a  series  of  instructions  which  would  test  the  values  of 
the  operands  of  the  add  instruction  at  specialization 
time  (if  they  are  available)  and  eliminate  the  instruc¬ 
tion  altogether  if  either  one  is  0. 

That  we  wish  to  produce  multi-staged  programs  is 
a  potential  problem  for  our  abstract  machine.  If  we 
encode  generating  extensions  with  emit(z)  instruc¬ 
tions,  must  a  program  which  contains  a  generating 
extension  which  produces  code  which  is  itself  a  gen¬ 


erating  extension  give  rise  to  instructions  of  the  form 
emit  (emit  (i) )?  If  so,  then  a  chain  of  n  generating 
extensions  could  lead  to  n  nested  emits. 

Observe,  however,  that  on  a  machine  with  fixed- 
length  instructions  there  is  a  limited  amount  of  space 
available  for  immediate  operands,  and  so  if  instruc¬ 
tions  to  be  emitted  are  embedded  in  instructions  in 
the  instruction  stream,  it  will  take  at  least  two  in¬ 
structions  to  represent  one  emitted  instruction.  Fur¬ 
thermore  it  could  take  2n  instructions  to  represent 

n 

- - - * - s 

emit(emit(-  •  •  emit(i)  •  ■  ♦)).  For  this  reason,  nested 
emits  are  not  allowed  on  the  CCAM,  and  our  compi¬ 
lation  scheme  needs  to  take  special  steps  in  order  to 
allow  multi-level  specialization.  We  show  how  to  do 
this  in  section  5. 


4.3  Instructions 

The  CCAM  has  the  usual  seven  instructions  associ¬ 
ated  with  the  CAM,  and  five  more  for  code  genera¬ 
tion.  emit(z),  which  has  already  been  described,  cre¬ 
ates  the  instruction  i  in  a  new,  dynamically  created 
code  sequence,  called  an  arena.  The  lift  instruc¬ 
tion  residualizes  a  value  into  an  arena,  arena  creates 
a  new  arena,  while  call  inserts  dynamically  gener¬ 
ated  code  from  an  arena  into  the  current  instruction 
stream.  Finally  merge  merges  two  arenas  by  inserting 
one  as  a  function  in  the  other. 


Simple  Inst  i 

Composite  Inst  I 

Values  v,u 

Code  Blocks  B 

Sequences  P 

Stacks  S 


::=  id  |  fst  |  snd  |  push 
|  swap  |  cons  |  app 
|  ‘ v  | lift  |  arena 
|  merge  |  call 

:=  i  |  emit(i)  |  Cur(P) 

:=  (v,  u)  |  [t;  :  P]  |  B  |  () 

:=  {P} 

:=  -\I\P 

:=  •  |  v  ::  S 


8 


4.4  Transitions 

A  configuration,  (S',  P),  of  the  CCAM  consists  of  a 
stack  of  values  and  an  instruction  sequence,  repre¬ 
senting  the  current  instruction  stream.  We  will  rou¬ 
tinely  omit  the  final  •  on  stacks  and  instruction  se¬ 
quences.  We  use  Pf@P  to  represent  the  obvious  se¬ 
quence  obtain  by  appending  the  sequences  P  to  the 
sequence  P'.  Figure  3  lists  the  transitions  of  the 
CCAM. 

5  Compilation 

The  translation  from  AD  to  CCAM  code  is  detailed 
in  this  section.  The  translation  is  divided  into  two 
parts:  translation  of  code  which  is  not  initially  inside 
a  code  generator,  and  the  translation  of  code  genera¬ 
tors.  These  two  translations  are  represented  by  [M]# 
and 

[. MJe  denotes  the  translation  of  non-code- 
generating  code  M  in  a  context  E,  which  simply  de¬ 
scribes  the  location  of  variables  in  the  run-time  envi¬ 
ronment.  Variable  contexts  are  built  from  variables 
and  the  empty  context  as  follows: 

VariableEnvironments  E,LE  ::=  •  |  E,  u 

To  save  space  and  for  convenience  we  will  of¬ 
ten  write  emit(i)  as  i,  and  the  pairing  operators 
push, swap,  and  cons  as  V,  and  ‘)?,  respectively. 

The  rules  for  translating  applications,  non-modal 
variables,  and  abstractions  in  a  non-code-generating 
context  are  the  same  as  those  in  [6].  We  compile  code 
expressions  to  generating  extensions,  which  are  func¬ 
tions  from  arenas  to  arenas.  An  extension  emits  its 
code  into  its  argument  arena,  and  returns  that  trans¬ 
formed  arena.  Modal  variables  must  select  out  of  the 
environment  the  generating  extension  to  which  it  is 
bound,  and  apply  the  extension  to  a  new  arena,  and 
then  finally  jump  to  the  newly  created  code.  Thus, 
it  is  when  modal  variables  are  referenced  outside  of 
code  constructs  that  code  generation  actually  occurs. 
Finally,  the  let  cogen  construct  translates  to  code 
which  augments  the  environment  with  the  result  of 
the  bound  expression  and  then  executes  the  body  of 
the  expression. 


The  {M}  fE  relation  compiles  a  An  term  into  a 
generating  extension.  It  uses  two  contexts,  an  “early” 
context  E  which  will  hold  the  location  of  variables  in 
the  environment  from  all  stages,  and  a  “late”  context 
LE ,  which  is  really  just  a  pointer  into  the  early  con¬ 
text  that  marks  the  division  between  variables  avail¬ 
able  at  generation  time  and  those  which  will  only 
be  available  later  when  the  generated  code  is  run. 
The  translation  rules  for  applications  and  non-modal 
variables  underneath  code  constructors  are  similar 
to  those  for  their  non-code-generating  relatives,  ex¬ 
cept  the  instructions  are  buried  under  emit()  in¬ 
structions.  The  abstraction  rule,  on  the  other  hand, 
is  complicated  considerably  by  the  fact  that  the  ar¬ 
gument  of  a  Cur  is  a  sequence  of  instructions,  and  in¬ 
structions  must  be  emitted  individually.  This  is  the 
reason  for  the  merge  instruction.  It  enables  us  to  emit 
code  to  a  new  arena  and  then  treat  that  code  as  the 
body  of  a  function.  Implemented  on  a  real  computer, 
this  would  correspond  to  the  fact  that  the  text  of  a 
function  is  typically  stored  in  a  separate  area,  and  a 
function  call  involves  jumping  to  the  location  of  the 
function. 

Translating  modal  variables  under  code  construc¬ 
tors  depends  on  where  the  variable  is  bound.  If  it  is 
bound  under  the  same  code  constructor  in  which  the 
variable  finds  itself,  then  there  is  no  generating  exten¬ 
sion  yet  available  in  the  environment  for  it,  and  so  it 
must  be  rebuilt  as  a  reference  to  its  binder.  If,  on  the 
other  hand,  it  is  bound  outside  the  code  construc¬ 
tor,  then  it  should  be  applied  to  the  current  arena, 
thereby  effectively  substituting  its  code  into  the  cur¬ 
rent  code. 

The  primary  difficulty  in  the  compilation  is  avoid¬ 
ing  nested  emits.  We  achieve  this  by  arranging  to 
have  generating  extensions  specialize  all  of  the  code 
that  they  contain,  except  the  code  for  other  generat¬ 
ing  extensions.  This  results  in  a  rather  complicated 
looking  case  in  the  compilation  for  code  expressions 
under  other  code  constructors.  Essentially,  the  code 
arranges  to  have  a  closure  containing  the  body  of  the 
code  expression  inserted  into  the  arena.  This  closure 
is  explicitly  applied  to  the  “late”  environment  so  that 
it  can  access  all  the  variables  bound  within  it. 

The  boxed  lift  and  let  cogen  rules  are  mostly 
emitted  versions  of  the  unboxed  forms,  except  that 
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Stack 

Program. 

Stack 

Program 

s 

id;  P 

s 

P 

( v ,  u)  ::  5 

f  st;  P 

v::S 

P 

( v ,  u )  ::  S 

snd;  P 

u::S 

p 

v  ::  S 

cu;P 

u::S 

P 

v  ::  S 

push;  P 

v  ::  v  ::  S 

p 

v  ::  u  ::  S 

swap; P 

u  ::  v  ::  S 

p 

v  ::  u  ::  S 

cons; P 

( u ,  v)  ::  S 

p 

v.-.S 

Cur(P');  P 

[»  :P']::S 

p 

(b:PV)  ::S 

app;  P 

{v,  u)  ::  S 

p'@p 

MP'})  ::S 

emit(i);  P 

(v,{P'@(i;-)})y.s 

p 

Mn) 

lift; P 

(v,{P'@(‘v,.)})-.:S 

p 

v::S 

arena;  P 

p 

:  S  merge;  P 

(v,{P"-,Cur(P')})  ::S 

p 

(v,{P'})  ::S 

call; P 

v  ::  S 

p'@p 

Figure  3:  Transitions  of  the  CCAM 


the  lift  rules  needs  to  go  through  the  same  contor¬ 
tions  as  abstractions  to  insert  a  Cur  into  the  arena. 


6  ML°  compiler 


We  have  implemented  a  prototype  MLD  compiler, 
for  a  large  subset  of  core  ML,  including  datatypes, 
refence  cells,  and  arrays,  extended  with  the  modal 
constructs.  All  of  the  programs  presented  in  this 
paper  are  working  programs  compilable  by  our  com¬ 
piler.  The  compiler  generates  code  for  the  CCAM  ex¬ 
tended  with  support  for  conditionals,  recursion,  and 
various  base  types. 

In  addition,  we  have  built  a  CCAM  simualtor  on 
which  to  run  the  output  of  our  compiler.  While 
CCAM  instructions  are  rather  abstract  compared  to 
native  machine  code,  we  can  still  observe  the  bene¬ 
fits  of  specialization  by  counting  reduction  steps  in 
CCAM  programs. 


Computation 

Reductions 

evalpf  on  first  telnet  packet 

9163 

evalpf  on  nth  telnet  packet 

9163 

bevalpf  on  first  telnet  packet 

11984 

bevalpf  on  nth  telnet  packet 

1104 

evalPoly  (47,polyl) 

807 

specPoly  polyl 

443 

polylTarget  47 

175 

compPoly  polyl 

553 

eval  codeGenerator 

200 

mlPolyFun  47 

74 

Table  1:  Reduction  steps  on  the  CCAM  for  various 
functions  in  the  text 


7  Conclusion 

We  have  designed  and  implemented  a  compiler  for  the 
language  ML°  which  compiles  code  expressions  into 
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getia ,  (£,  a)) 
get(a,  ( E,b )) 

gref(a,  (I?,  a)) 
get (g,  ( E,b )) 

Mb 

[A  x.MJb 
I  MNJe 

He 

[code  M]e 
[lift  M\e 
[let  cogen  u 

H?E 

\\x.MJee 

lMN}fE 


Mle 

[code  MJ  fE 
[lift  M] fE 

[let  cogen  u 


—  snd 

=  fst^e^a,#) 


=  snd 

=  fst ;get(a,  E) 

=  get(x,E) 

=  Cur([M]f;) 

=  ([MKlJVj^app 


M  in  NJe 


(get(u,  E),  arena);  app;  call 
Cur([M]f ) 

[M]b;  Cur(lift) 

([M|£);[7V](E,U) 


get(x,  LE) 

((fst,  arena);  [M]jf^); 

aPP 


snd,  id);  merge 


(fst;  getju ,  Z£l)1arena);  app;  call  if  u  is  in  LE 
(fst,  (fst]  get(u,  E ),  snd);  app;  snd)  otherwise 
=  {(fst,  (<();  Cnr(snd;  Cur([Mjf )),  snd);  lift;  snd),id);  app 
—  IMJle'i  {(f  stJ  arena);  lift;  snd,  id);  merge 


=  M  in  NJ 


LE  — 


(mE 


LE, 


Figure  4:  Compilation  rules 
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code  generators.  The  compiler  targets  the  CCAM,  an 
extension  of  the  CAM  carefully  designed  to  emulate 
the  style  of  run-time  code  generation  first,  provided 
by  the  Fabius  compiler. 

In  our  early  experience  with  the  MLD  language  and 
our  compiler,  we  have  been  able  to  express  precisely 
the  staging  of  computations  necessary  to  take  best 
advantage  of  the  run-time  code  generation  facilities 
of  the  CCAM.  This  experience  is  an  early  indication 
that  a  language  that  provides  explicit  control  over 
staging  decisions  can  be  a  practical  way  to  improve 
the  performance  of  programs. 
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